ope method
- North America > United States > California > Santa Cruz County > Santa Cruz (0.14)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- Asia > South Korea (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- (5 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.93)
- (2 more...)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.67)
- Health & Medicine (1.00)
- Education (1.00)
- North America > United States > North Carolina > Wake County > Raleigh (0.04)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- Europe > France (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Health & Medicine > Surgery (0.69)
- Health & Medicine > Therapeutic Area > Oncology (0.67)
- (3 more...)
Off-Policy Evaluation for Human Feedback
Off-policy evaluation (OPE) is important for closing the gap between offline training and evaluation of reinforcement learning (RL), by estimating performance and/or rank of target (evaluation) policies using offline trajectories only. It can improve the safety and efficiency of data collection and policy testing procedures in situations where online deployments are expensive, such as healthcare. However, existing OPE methods fall short in estimating human feedback (HF) signals, as HF may be conditioned over multiple underlying factors and are only sparsely available; as opposed to the agent-defined environmental rewards (used in policy optimization), which are usually determined over parametric functions or distributions. Consequently, the nature of HF signals makes extrapolating accurate OPE estimations to be challenging. To resolve this, we introduce an OPE for HF (OPEHF) framework that revives existing OPE methods in order to accurately evaluate the HF signals. Specifically, we develop an immediate human reward (IHR) reconstruction approach, regularized by environmental knowledge distilled in a latent space that captures the underlying dynamics of state transitions as well as issuing HF signals. Our approach has been tested over, adaptive neurostimulation and intelligent tutoring, and a simulation environment (visual Q&A). Results show that our approach significantly improves the performance toward estimating HF signals accurately, compared to directly applying (variants of) existing OPE methods.
- Health & Medicine > Therapeutic Area (0.72)
- Education (0.68)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- (5 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.93)
- (2 more...)
- Health & Medicine > Therapeutic Area (0.72)
- Education (0.68)
- North America > United States > North Carolina > Wake County > Raleigh (0.04)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- Europe > France (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Health & Medicine > Surgery (0.69)
- Health & Medicine > Therapeutic Area > Oncology (0.67)
- (3 more...)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.67)
- Health & Medicine (1.00)
- Education (1.00)
Off-Policy Evaluation for Human Feedback
Off-policy evaluation (OPE) is important for closing the gap between offline training and evaluation of reinforcement learning (RL), by estimating performance and/or rank of target (evaluation) policies using offline trajectories only. It can improve the safety and efficiency of data collection and policy testing procedures in situations where online deployments are expensive, such as healthcare. However, existing OPE methods fall short in estimating human feedback (HF) signals, as HF may be conditioned over multiple underlying factors and are only sparsely available; as opposed to the agent-defined environmental rewards (used in policy optimization), which are usually determined over parametric functions or distributions. Consequently, the nature of HF signals makes extrapolating accurate OPE estimations to be challenging. To resolve this, we introduce an OPE for HF (OPEHF) framework that revives existing OPE methods in order to accurately evaluate the HF signals. Specifically, we develop an immediate human reward (IHR) reconstruction approach, regularized by environmental knowledge distilled in a latent space that captures the underlying dynamics of state transitions as well as issuing HF signals.